Jun SATO Tsutomu KIMURA Masaharu IMAI Frank de SCHEPPER Kazuo YAMAZAKI Masashi NAGASE Shin-ichiro YAMAMOTO
This letter describes the architecture and ASIC implementation of the FSP-3 (Flexible Servo motor control Processor-3) chip. The FSP-3 is a special purpose 32 bit microprocessor dedicated to the Flexible Servo Control System (FSC), which is able to manipulate various kinds of servo motors efficiently. FSP-3 chip is one of the largest scale system ASICs entirely designed in Japanese universities.
Alauddin Y. ALOMARY Masaharu IMAI Nobuyuki HIKICHI
One of the most interesting and most analyzed aspects of the CPU design is the instruction set design. How many and which operations to be provided by hardware is one of the most fundamental issues relaing to the instruction set design. This paper describes a novel method that formulates the instruction set design of ASIP (an Application Specific Integrated Processor) using a combinatorial appoach. Starting with the whole set of all possible candidata instructions that represesnt a given application domain, this approach selects a subset that maximizes the performance under the constraints of chip area, power consumption, and functional module sharing relation among operations. This leads to the efficient implementation of the selected instructions. A branch-and-bound algorithm is used to solve this combinatorial optimization problem. This approach selects the most important instructions for a given application as well as optimizing the hardware resources that implement the selected instructions. This approach also enables designers to predict the perfomance of their design before implementing them, which is a quite important feature for producing a quality design in reasonable time.
Takuji HIEDA Hiroaki TANAKA Keishi SAKANUSHI Yoshinori TAKEUCHI Masaharu IMAI
Partial forwarding is a design method to place forwarding paths on a part of processor pipeline. Hardware cost of processor can be reduced without performance loss by partial forwarding. However, compiler with the instruction scheduler which considers partial forwarding structure of the target processor is required since conventional scheduling algorithm cannot make the most of partial forwarding structure. In this paper, we propose a heuristic instruction scheduling method for processors with partial forwarding structure. The proposed algorithm uses available distance to schedule instructions which are suitable for the target partial forwarding processor. Experimental results show that the proposed method generates near-optimal solutions in practical time and some of the optimized codes for partial forwarding processor run in the shortest time among the target processors. It also shows that the proposed method is superior to hazard detection unit.
Nguyen Ngoc BINH Masaharu IMAI Yoshinori TAKEUCHI
In designing ASIPs (Application Specific Integrated Processors), the papers investigated so far have almost focused on the optimization of the CPU core and did not pay enough attention to the optimization of the RAM and ROM sizes together. This paper overcomes this limitation and proposes an optimization algorithm to define the best ratio between the CPU core, RAM and ROM of an ASIP chip to achieve the highest performance while satisfying design constraints on the chip area. The partitioning problem is formalized as a combinatorial optimization problem that partitions the operations into hardware and software so that the performance of the designed ASIP is maximized under given chip area constraint, where the chip area includes the HW cost of the register file for a given application program with associated input data set. The optimization problem is parameterized so that it can be applied with different technologies to synthesize CPU cores, RAMs or ROMs. The experimental results show that the proposed algorithm is found to be effective and efficient.
Masaharu IMAI Hitoshi KITAZAWA
Nguyen Ngoc BINH Masaharu IMAI Akichika SHIOMI Nobuyuki HIKICHI
This paper proposes a new method to design an optimal pipelined instructions set processor for ASIP development using a formal HW/SW codesign methodology. First, a HW/SW partioning algorithm for selecting an optimal pipelined architecture is outlined. Then, an adaptive detabase approach is presented that enables to enhance the optimality of the design through very accurate estimation of the performance of a pipelined ASIP in the HW/SW partitioning process. The experimental results show that the proposed method is effective and efficient.
Makiko ITOH Yoshinori TAKEUCHI Masaharu IMAI Akichika SHIOMI
A synthesizable HDL generation method for pipelined processors is proposed. By using the proposed method, data-path and control logic descriptions of a target processor is generated from a clock based instruction set specification. From the experimental results, feasibility of the proposed method is evaluated and the amount of processor design time was drastically reduced than that of conventional RT level manual design in HDL.
In this paper, a combinatorial problem oriented multicomputer system called DON (Double-Tree Structured Network Machine) is proposed. And a parallel branch-and-bound program scheme for the DON system is described. The DON system is composed of two binary-tree structured subsystems and a system controller. The DON system works as a post-end processor of a host computer system. The DON system is designed to achieve high parallelism and efficient pipeline ability. One of the most distinctive features of the DON system, compared to a conventional single-tree machine, is that the algorithms with pipeline features can be easily implemented and executed more efficiently. From the experimental results through simulation, it appears that the DON system can solve large scale combinatorial problems more efficiently than a conventional single-tree machine.
Takumi NAKANO Yoshiki KOMATSUDAIRA Akichika SHIOMI Masaharu IMAI
In a real-time system, it is required to reduce the response time to an interrupt signal, as well as the execution time of a Real-Time Operating System (RTOS). In order to satisfy this requirement, we have proposed a method of implementing some of the functionalities of an RTOS using hardware. Based on this idea, we have implemented a VLSI chip, called STRON (silicon TRON: The Realtime Operating system Nucleus), to enhance the performance of an RTOS, where the STRON chip works as a peripheral unit of any MPU. In this paper we describe the hardware architecture of the STRON chip and the performance evaluation results of the RTOS using the STRON chip. The following results were obtained. (1) The STRON chip is implemented in only about 10,000 gates when the number of each object (task, event flag, semaphore, and interrupt) is 7. (2) The task scheduler can execute within 8 clocks in a fixed period using the hardware algorithm when the number of tasks is 7. (3) Most of the basic µITRON system calls using the STRON chip can be executed in a fixed period of a few microseconds. (4) The execution time of a system call, measured by a multitask application program model, can be reduced to about one-fifth that in the case of the conventional software RTOS. (5) The total performance, including context switching, is about 2.2 times faster than that of the software RTOS. We conclude that the execution time of the part of the system call implemented by the STRON chip can almost be ignored, but the part of the interface software and context switching related to the architecture of a MPU strongly influence the total performance of an RTOS.
Shinsuke KOBAYASHI Kentaro MITA Yoshinori TAKEUCHI Masaharu IMAI
This paper proposes a compiler generation method for PEAS-III (Practical Environment for ASIP development), which is a configurable processor development environment for application domain specific embedded systems. Using the PEAS-III system, not only the HDL description of a target processor but also its target compiler can be generated. Therefore, execution cycles and dynamic power consumption can be rapidly evaluated. Two processors and their derivatives were designed using the PEAS-III system in the experiment. Experimental results show that the trade-offs among area, performance and power consumption of processors were analyzed in about twelve hours and the optimal processor was selected under the design constraints by using generated compilers and processors.
Katsuya SHINOHARA Norimasa OHTSUKI Yoshinori TAKEUCHI Masaharu IMAI
This paper proposes an ASIP performance optimization method taking clock frequency into account. The performance of an instruction set processor can be measured using the execution time of an application program, which can be determined by the clock cycles to perform the application program divided by the applied clock frequency. Therefore, the clock frequency should also be tuned in order to maximize the performance of the processor under the given design constraints. Experimental results show that the proposed method determines an optimal combination of FUs considering clock frequency.
Masaharu IMAI Yoshio SUGIZAKI Koichi ASATANI
The Internet real-time applications are growing rapidly, and available bandwidth estimation is required. Available bandwidth estimation methods by end host have been studied e.g. Pathload and pathChirp. These methods parameterize probe packet volume and observe the delay variation to estimate available bandwidth. In these methods, the probe packets impose heavy overhead loads on the network. In this paper, we propose a new available bandwidth estimation method based on the frequency of minimum RTT of probe packets in multi hop links. This method estimates bandwidth utilization and available bandwidth of a bottleneck link without significantly increasing network overhead. Estimation accuracies are evaluated for available bandwidth by implementing the proposed method. The proposed method shows better performance than pathChirp or Pathload, requiring fewer probe packets and less estimation time simultaneously.
Yuuka HIRAO Yoshinori TAKEUCHI Masaharu IMAI Jaehoon YU
Heart disease is one of the major causes of death in many advanced countries. For prevention or treatment of heart disease, getting an early diagnosis from a long time period of electrocardiogram (ECG) examination is necessary. However, it could be a large burden on medical experts to analyze this large amount of data. To reduce the burden and support the analysis, this paper proposes an arrhythmia detection method based on a deformable part model, which absorbs individual variation of ECG waveform and enables the detection of various arrhythmias. Moreover, to detect the arrhythmia in low processing delay, the proposed method only utilizes time domain features. In an experimental result, the proposed method achieved 0.91 F-measure for arrhythmia detection.
Masaharu IMAI Yuuji YOSHIDA Teruo FUKUMURA
The amount of memory space required by a branch-and-bound algorithm depends on the search strategy used in the algorithm. From the viewpoint of implementing branch-and-bound algorithms, it is desirable that the amount of memory space can be bounded to some feasible size. In this paper, we propose two new search strategies for branch-and-bound algorithms, by which the amount of required memory space is controllable. These strategies are named pdfs (parallel depth-first search)" and blis (breadth limited search)", respectively. One of the main results of this paper is that (a) the amount of required memory space of any of these strategies is a linear function of the size of the given problem and (b) the amount of required memory space is controllable by adjusting appropriate parameter. That is, these search strategies are adaptable to the available memory space. Another result of this paper is that the computational performance of a branch-and-bound algorithm, using any of these strategies, can be improved by adjusting appropriate parameters.
Hideki YAMAUCHI Yoshinori TAKEUCHI Masaharu IMAI
This paper proposes an efficient architecture for fractal image coding processors. The proposed architecture achieves high-speed image coding comparable to conventional JPEG processing. This architecture achieves less than 33.3 msec fractal image compression coding against a 512 512 pixel image and enables full-motion fractal image coding. The circuit size of the proposed architecture design is comparable to those of JPEG processors and much smaller than those of previously proposed fractal processors.
Nguyen Ngoc BINH Masaharu IMAI Akichika SHIOMI Nobuyuki HIKICHI Yoshimichi HONMA Jun SATO
In this paper we describe the formal conditions to detect and resolve all kinds of pipeline data hazards and propose a scheduling algorithm for pipelined instruction set processor synthesis. The algorithm deals with multi cycle operations and tries to minimize the pipeline execution cycles under a given hardware configuration with/without hardware interlock. The main feature that makes the proposed algorithm different from existing ones is the algorithm is for estimating the performance in HW/SW partitioning, with capability of handling a module library of different FUs and dealing with multi cycle operations to be implemented in software. Experimental results of application to ASIP HW/SW codesign show that the proposed algorithm is effective and considerable pipeline execution cycle reduction rates can be achieved. The time complexity of the scheduing algorithm is of O(n2) in the worst case, where n is the number of instructions in a given basic block.
Yuki KOBAYASHI Murali JAYAPALA Praveen RAGHAVAN Francky CATTHOOR Masaharu IMAI
Clustering L0 buffers is effective for energy reduction in the instruction memory caches of embedded VLIW processors. However, the efficiency of the clustering depends on the schedule of the target application. For improving the energy efficiency of L0 clusters, an operation shuffling is proposed, which explores assignment of operations for each cycle, generates various schedules, and evaluates them to find an energy efficient schedule. This approach can find energy efficient schedules, however, it takes a long time to obtain the final result. In this paper, we propose a new method to directly generate an energy efficient schedule without iterations of operation shuffling. In the proposed method, a compiler schedules operations using the result of the single operation shuffling as a constraint. We propose some optimization algorithms to generate an energy efficient schedule for a given L0 cluster organization. The proposed method can drastically reduce the computational effort since it performs the operation shuffling only once. The experimental results show that comparable energy reduction is achieved by using the proposed method while the computational effort can be reduced significantly over the conventional operation shuffling.
Hiroaki TANAKA Yoshinori TAKEUCHI Keishi SAKANUSHI Masaharu IMAI Hiroki TAGAWA Yutaka OTA Nobu MATSUMOTO
SIMD instructions are often implemented in modern multimedia oriented processors. Although SIMD instructions are useful for many digital signal processing applications, most compilers do not exploit SIMD instructions. The difficulty in the utilization of SIMD instructions stems from data parallelism in registers. In assembly code generation, the positions of data in registers must be noted. A technique of generating pack instructions which pack or reorder data in registers is essential for exploitation of SIMD instructions. This paper presents a code generation technique for SIMD instructions with pack instructions. SIMD instructions are generated by finding and grouping the same operations in programs. After the SIMD instruction generation, pack instructions are generated. In the pack instruction generation, Multi-valued Decision Diagram (MDD) is introduced to represent and to manipulate sets of packed data. Experimental results show that the proposed code generation technique can generate assembly code with SIMD and pack instructions performing repacking of 8 packed data in registers for a RISC processor with a dual-issue coprocessor which supports SIMD and pack instructions. The proposed method achieved speedup ratio up to about 8.5 by SIMD instructions and multiple-issue mechanism of the target processor.
Ittetsu TANIGUCHI Ayataka KOBAYASHI Keishi SAKANUSHI Yoshinori TAKEUCHI Masaharu IMAI
Forward error correction (FEC) is one of important and heavy tasks for wireless communication. Leading edge mobile embedded systems usually support not only one FEC standard, but multiple FEC standards in order to adapt to various wireless communication standards. In this paper, we propose two-stage configurable decoder model (2-Stage CDM) for multiple FEC standards for Viterbi and Turbo coding which have a variation under the constraint length, coding rate, etc. Proposed decoder model realizes a decoder instance which supports dedicated multiple FEC standards, and rapid design for domain specific decoder is realized. Proposed decoder model is configurable in two stages: at hardware generation time and at runtime, and designers can easily specify these specifications by various design parameters. Experimental results show proposed two-stage configurable decoder model supports various domain specific FEC decoder including existing decoder, and the decoder instances based on proposed 2-Stage CDM have sufficient throughput for each communication standard and reasonable area overhead compared with existing decoder.